Slovene-Croatian Treebank Transfer Using Bilingual Lexicon Improves Croatian Dependency Parsing

نویسندگان

  • Željko Agić
  • Danijela Merkler
  • Daša Berović
چکیده

A method is presented for transferring dependency treebanks between similar languages by using a bilingual lexicon, aiming to improve dependency parsing accuracy on the target language. It is illustrated by transferring the Slovene Dependency Treebank to Croatian by using a GIZA++ bilingual lexicon constructed from the Croatian-Slovene 1984 parallel corpus from the Multext East project. The transferred treebank is merged with the Croatian Dependency Treebank and the merged treebank is used to train and test two graph-based dependency parsers. MSTParser and CroDep accuracy on parsing the 1984 fictional text shows a statistically significant increase and a similar decrease on parsing the Croatian Dependency Treebank newspaper text. Slovensko-hrvaški prenos drevesnic z uporabo dvojezičnega leksikona izboljša odvisnostno razčlenjevanje hrvaščine Prispevek predstavi metodo za prenos skladenjskih oznak korpusov med podobnimi jeziki z uporabo dvojezičnega leksikona, katere namen je izboljšati točnost odvisnostnega razčlenjevanja na ciljnem jeziku. Metodo ilustriramo s prenosom Slovenske odvisnostne drevesnice na hrvaški jezik z uporabo dvojezičnega leksikona, ki smo ga s programom GIZA++ izluščili iz vzporednega hrvaškoslovenskega korpusa 1984 projekta MULTEXT-East. Prenešena drevesnica je združena s Hrvaško odvisnostno drevesnico, združena drevesnica pa je nato uporabljena za učenje in testiranje dveh odvisnostnih razčlenjevalnikov, ki temeljita na teoriji grafov. Natančnost razčlenjevalnikov MSTParser in CroDep na leposlovnem delu 1984 pokaže statistično signifikantno izboljšanje in podobno zmanjšanje na razčlenjevanju Hrvaške odvisnostne drevesnice.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Universal Dependencies for Croatian (that work for Serbian, too)

We introduce a new dependency treebank for Croatian within the Universal Dependencies framework. We construct it on top of the SETIMES.HR corpus, augmenting the resource by additional part-of-speech and dependency-syntactic annotation layers adherent to the framework guidelines. In this contribution, we outline the treebank design choices, and we use the resource to benchmark dependency parsing...

متن کامل

K-Best Spanning Tree Dependency Parsing With Verb Valency Lexicon Reranking

A novel method for hybrid graph-based dependency parsing of natural language text is proposed. It is based on k-best maximum spanning tree dependency parsing and evaluation of the spanning trees by using a verb valency lexicon for a given language as a reranking knowledge base. The approach is compared with existing state-of-the-art transition-based and graph-based approaches to dependency pars...

متن کامل

COLING 2012 24 th International Conference on Computational Linguistics

A novel method for hybrid graph-based dependency parsing of natural language text is proposed.It is based on k-best maximum spanning tree dependency parsing and evaluation of the spanningtrees by using a verb valency lexicon for a given language as a reranking knowledge base.The approach is compared with existing state-of-the-art transition-based and graph-basedapproaches to...

متن کامل

Croatian Dependency Treebank 2.0: New Annotation Guidelines for Improved Parsing

We present a new version of the Croatian Dependency Treebank. It constitutes a slight departure from the previously closely observed Prague Dependency Treebank syntactic layer annotation guidelines as we introduce a new subset of syntactic tags on top of the existing tagset. These new tags are used in explicit annotation of subordinate clauses via subordinate conjunctions. Introducing the new a...

متن کامل

Parsing Croatian and Serbian by Using Croatian Dependency Treebanks

We investigate statistical dependency parsing of two closely related languages, Croatian and Serbian. As these two morphologically complex languages of relaxed word order are generally under-resourced – with the topic of dependency parsing still largely unaddressed, especially for Serbian – we make use of the two available dependency treebanks of Croatian to produce state-of-the-art parsing mod...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012